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1. INTRODUCTION 

We are very grateful to the Executive Editors 
George Casella and Edward George for their active 
interest in our paper and for organizing this chal- 
lenging discussion. We also thank all the discussants 
for their insightful and stimulating comments. 

When we submitted the original manuscript in 
2003, we were tempted to go for a more general 
paper on kernel methods. We decided to focus on 
support vector machines (SVMs), waiting for a ma- 
ture development of the new and exciting ideas re- 
lated to kernel methods, such as manifold learning 
and other related topics. We will refer to some of 
these methods below. Let us begin, first, with some 
general considerations. 

Regarding the question of the dimensionality in- 
duced by the feature space, Hastie and Zhu remark 
in their comment that usual kernels do not auto- 
matically lead to infinite-dimensional feature spaces. 
They give a nice example that involves the radial 
(Gaussian) kernel function. This agrees with results 
in Keerthi and Lin [8] , where an explanation of the 
performance of the Gaussian kernel is given when, 
according to the notation in the comment by Hastie 
and Zhu, 7 — > and A is chosen in the appropriate 
way. In this case, the SVM classifier converges to 
a linear SVM classifier, and the effective dimension 
of the kernel is finite, agreeing with the empirical 
conclusion provided by the discussants. 

We also agree with the assertions of some of the 
discussants regarding the probabilistic interpretabil- 
ity of the SVM output (the sign of some estimated 
function) . Our comment was rather along the line of 
Sollich [18], who proposed to make Bayesian meth- 
ods available for the support vector methodology, 
while leaving as much as possible of the standard 
SVM framework intact. This is not an easy task. 
In fact, as Bartlett, Jordan and McAuliffe remark, 



This is an electronic reprint of the original article 
published by the Institute of Mathematical Statistics in 
Statistical Science, 2006, Vol. 21, No. 3, 358-362. This 
reprint differs from the original in pagination and 
typographic detail. 



sparseness and the precise estimation of conditional 
probabilities are hard to reconcile. 

Regarding the role of differentiability in SVMs 
(misplaced in the opinion of Bartlett, Jordan and 
McAuliffe), it is convenient to recall that the differ- 
entiable formulation of the SVM problem allows its 
solution by the use of standard Newton-type meth- 
ods for convex optimization. Under the availability 
of second order derivatives (and this is the case for 
SVMs), these methods are known to be the most 
efficient ones for the solution of smooth problems. 

We thank some of the discussants for turning the 
attention of the reader to general kernel methods. In 
particular, we appreciate the Bartlett, Jordan and 
McAuliffe effort to make clearer the potential impact 
of reproducing kernel Hilbert space (RKHS) meth- 
ods. Regarding the origins of RKHS in statistics, for 
the sake of completeness, we strongly recommend 
reading the conversation with Emanuel Parzen in [14] 

Given the history of SVMs, perfectly outlined by 
Wahba in the introduction of her comment, we do 
not like to think of SVMs as a "modest" variant 
of some standard statistical methodology (as sug- 
gested by Bartlett, Jordan and McAuliffe). Using a 
similar (a posteriori) reasoning, some strict mathe- 
maticians might think that RKHS methods in statis- 
tics are just a small variation on the general theory 
of Hilbert spaces. Of course, this is far from true. 
We rather think that the support vector method- 
ology, followed closely by kernel methods, has been 
able to synthesize a variety of techniques from dif- 
ferent fields, leading to a more unified framework 
for learning theory [5]. In addition, the geometrical 
viewpoint of SVMs allows new approaches to long- 
familiar problems, as illustrated in the next section. 

2. KERNEL METHODS REVISITED 

One interesting point regarding the geometrical 
interpretation of SVMs is that they have stirred the 
development of new techniques driven by the ge- 
ometrical properties of the kernel. Some of these 
techniques have not so far been mentioned in the 
discussion. We now briefly describe two relevant ex- 
amples. 
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2.1 One-Class SVMs 

An example of a new method that has arisen from 
a geometrical point of view is one-class SVMs [16]. 
One-class SVMs deal with a problem related to esti- 
mating high density regions from data samples. The 
method computes a binary function that takes the 
value +1 in "small" regions that contain most data 
points and takes the value —1 elsewhere. The strat- 
egy of the one-class support vector method is to map 
the data points into the feature space determined by 
a kernel function and to calculate a hyperplane that 
separates the mapped data {<3?(xj)}™ =1 from the ori- 
gin, where $ is the mapping induced by the kernel 
function. With this aim, the one-class SVM algo- 
rithm solves the quadratic optimization problem 
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where f j are slack variables, v G [0, 1] is an a pri- 
ori fixed constant which represents the fraction of 
outlying points and b is the decision value which 
determines whether a given point belongs to the es- 
timated high density region. The decision function 
will take the form h(x) = sign(w* T( 3?(x) — b*), where 
w* and b* are the values of w and b at the solution 
of problem (2.1). The hyperplane w* T <£(x) - b* = 
separates from the origin the mapped data for which 
the decision function /i(x) = +1. Problem (2.1) is 
smooth and convex, and follows the SVM idea of 
building a hyperplane in a feature space. 

It is apparent that solving the problem of esti- 
mating high density regions by building a separat- 
ing hyperplane in a feature space is not trivial. Next, 
we provide an original statistical explanation of one- 
class SVMs. Consider the class of real-valued func- 
tions 
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where X is the input space and / is the data den- 
sity function. To estimate the outlying points that 
correspond to the proportion v, all we have to do 
is use the order induced by any function g £ Q on 
the data sample {xi, . . . ,x n }. This is equivalent to 



solving the optimization problem 
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where, at the solution A* = {\\, . . . , A* ) T , A* > if 
Xj is an outlying point [i.e., A* > for small values of 
g(xj)] and A* = otherwise. The dual of this linear 
problem is 
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s.t. 3(xi)>6-&, i = l,...,n, 

£i>0, i = l,...,n. 

It can be shown that, at the solution, the values of 
the objective functions of problems (2.3) and (2.4) 
coincide (see, e.g., [1]). Moreover, the solution of 
problem (2.3) can be straightforwardly calculated 
from the solution of problem (2.4). 

The one-class SVM problem (2.1) and problem 
(2.4) are very similar. It becomes clear now that 
the solution of problem (2.1) (the one-class SVM 
solution) tries to estimate a function g € Q by g(x) = 
w* T $(x), that is, by estimating in the feature space 
the weights w* of a hyperplane with minimum norm. 
This is achieved through the inclusion of the term 
l/2||u;|| 2 in the objective function of problem (2.1). 

Appropriate mappings and kernels to solve the 
problem of estimating high density regions (density 
level sets) using one-class SVMs are derived and can 
be consulted in [13]. 

2.2 Combination of Kernels 

Another example of a technique developed from 
geometric considerations of the kernel is now illus- 
trated. This method falls in the category of "further 
advances" mentioned by Bousquet and Schblkopf at 
the end of their comment. In particular, we build, 
for classification purposes, what they call a joint ker- 
nel (mixing inputs and outputs). This joint kernel 
is built by the combination of a set of kernels. A 
key point of our proposal is that the constructed 
kernel tries to capture the "right" notion of simi- 
larity. This agrees with the comment by Bousquet 
and Scholkopf about the relationship between the 
good performance of SVMs in practice and the ap- 
propriate prior knowledge about the problems incor- 
porated by kernels. Thus, we will work with similar- 
ity matrices instead of kernel matrices. In fact, as 
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Wahba points out, Euclidean distances (and there- 
fore similarities) can be derived from positive defi- 
nite kernels. 

The idea, in geometric terms, is introduced next. If 
kernels are being used, points in a sufficiently small 
neighborhood in the feature space should belong to 
the same class (excluding points very close to the 
decision surface). As a consequence, if we are going 
to classify a data set by relying on a given simi- 
larity matrix, points close to each other using such 
similarities should, in general, be in the same class. 
Therefore, we have to construct a similarity matrix 
K* with entries K*(xi,Xj) that are large for X{ and 
Xj in the same class (i.e., yi = yj), and small for 
Xi and Xj in different classes. For instance, if two 
kernels K\ and K 2 are to be combined, a possible 
choice is 

{max(Ki(xi,Xj),K 2 (xi,Xj)), 
XDm(K 1 {xi,x j ),K 2 {x i ,x j )), 
otherwise. 

It is immediate to show that (2.5) is equivalent to 

(2.6) K* = \{K l + K 2 ) + \Y\K l -K 2 \Y, 

where Y = diag(y) is a diagonal matrix whose nonzero 
elements are the data labels, that is, yi 6 {— 1, +1}- 
Let Ki,K 2 , . . . , Km be the available set of M in- 
put kernel matrices, all of which are obtained from 
the same data sample {xi, . . . , x n }. The extension of 
the previous idea to the combination of more than 
two kernel matrices is 

(2.7) K* = K + Yj2g(K i -K j )Y, 

i<j 

where K is the average of the kernel matrices and 
g is a function that quantifies the difference of in- 
formation between kernel matrices. The function g 
must have the property that if K{ and Kj tend to 
produce the same classification results, then g(Ki — 
Kj) should be almost null. A particular case of the 
previous equation is 

M 

(2.8) K* = K + Y \K m -K\Y. 

m=l 

This and other choices for g can be consulted in [12]. 

Next, we show how this method (denoted AV for 
absolute value) can be used to improve the perfor- 
mance of single kernels. With this aim, we will use 
the breast cancer data set, made up of 683 observa- 
tions with 9 features each [11]. We have considered 



Table 1 

Percentage of mis classified data and support vectors for the 
cancer data. Standard deviations in parentheses 





Training error 


Test error 


% SV 


K\: Polynomial 


0.1(0.1) 


7.8(2.5) 


8.3(0.8) 


K2: Gaussian 


0.0(0.0) 


10.8(1.7) 


65.6(1.0) 


K2: Linear 


2.6(0.5) 


3.7(1.8) 


7.1(0.8) 


AV 


2.4(0.3) 


3.1(1.3) 


2.9(0.4) 


SDP 


0.0(0.0) 


6.2(1.6) 


65.5(1.9) 



three kernels: a polynomial kernel K\{x,z) = (1 + 
x T z) 2 , a Gaussian kernel K 2 (x, z) = exp^ - "^ -2 ' ) and 
a linear kernel K$(x, z) = x T z. We will compare SVMs 
using these kernels with the AV combination method 
and a semidefinite programming (SDP) technique 
for building linear combinations of kernels developed 
by Lanckriet et al. [9]. The data set has been ran- 
domly partitioned ten times into a training set and 
a test set, and for each method, a run of the exper- 
iment has been done over each partition. The aver- 
age results are shown in Table 1. The AV method 
provides the best results (a test error of 3.1%), us- 
ing significantly less support vectors than the other 
methods. The SDP method improves only the re- 
sults of the Gaussian and the polynomial kernel. 

2.2.1 Parameter selection. Techniques for the com- 
bination of kernels can be successfully applied to 
the problem of parameter selection in kernel meth- 
ods. This links with the comment of Bousquet and 
Scholkopf about the need for further research on sat- 
isfactory alternatives, other than cross-validation, to 
choose the parameters in kernel methods. We illus- 
trate this situation using a collection of Gaussian 
kernels on the cancer data set. Let {Ki, . . . , K\ 2 } be 
a set of Gaussian kernels K c (x,y) = exp(~H x ~ y l / c ) 
with parameters c = 0.1, 1, 10, 20, 30, 40, 50, 60, 
70, 80, 90 and 100, respectively. This wide set cov- 
ers a realistic range of possible values for the kernel 
parameter. The test errors for 12 SVMs using these 
Gaussian kernels range from 3.1% to 24.7%. In this 
case, the AV method, combining the 12 Gaussian 
kernels, gives the best result obtained using only one 
of the Gaussian kernels under consideration (with a 
test error of 3.1%). It is important to note that the 
performance of the AV method (which is parameter- 
free) is not affected by the inclusion of kernels with 
a bad generalization performance. Since, in general, 
the best parameter choice is not known in advance, 



4 



J. M. MOGUERZA AND A. MUNOZ 



the methodology just described provides an alter- 
native that minimizes the effect of bad parameter 
selection. 

3. THE BIAS-VARIANCE PROBLEM 

Regarding the comments on statistical consistency 
provided by Bartlett, Jordan and McAuliffe, also 
pointed out by Bousquet and Scholkopf, we agree 
that the Vapnik-Chervonenkis (VC) dimension is 
not central in the analysis of SVMs. In fact, in [6], 
the bias-variance problem is analyzed in the context 
of regularization for the quadratic loss function (the 
analysis by Steinwart cited by the discussants is for 
the Li-SVM, as Bousquet and Scholkopf remark). 
Cucker and Smale [6] replaced the VC dimension by 
the radius r of a ball in a RKHS space (r is the norm 
in the RKHS of the minimizer of the empirical risk) . 
Since the regularization parameter A (using the no- 
tation of Bartlett, Jordan and McAuliffe) is inversely 
proportional to r, large values of A correspond to 
large bias, while small values of A lead to large vari- 
ance. The Cucker and Smale paper [6] also con- 
tains a theorem (Corollary 2) that is in agreement 
with the discussants' comment about the fact that 
the regularization coefficient must decrease with the 
sample size. In addition, statistical consistency can 
be derived from the results in the paper if a rich 
enough kernel is used (i.e., a universal kernel, in the 
sense of Steinwart). 

4. DIFFERENTIAL GEOMETRY METHODS 
AND KERNEL METHODS 

In her comment, Wahba introduces a particular 
method for learning the kernel data matrix for the 
purpose of manifold unfolding. As Wahba and her 
co-authors remark in [10], the manifold unfolding 
problem is closely related to the construction of a 
kernel. In fact, Ham, Lee, Mika and Scholkopf [7] 
showed that several of the proposed techniques for 
manifold learning, namely ISOMAP [19], graph Lapla- 
cian eigenmap [2] and locally linear embedding [15], 
can be interpreted as particular cases of kernel prin- 
cipal component analysis [17]. As Ham and co-authors 
point out, these techniques can be viewed as a "warp- 
ing of the input space into a feature space where the 
manifold is flat." 

In this regard, Burges [4] described the intrinsic 
geometry of the manifold which arises for a particu- 
lar choice of the kernel. In particular, he shows that 
the Riemannian metric induced on the manifold by 



its embedding can be expressed, in terms of the ker- 
nel, in closed form. A closely related approach can 
be found in [20]. It is worth mentioning that the 
implicit geometric assumption within manifold un- 
folding is that the decision surface (for the case of 
classification) is smooth with respect to the under- 
lying geometry [3]. 

Finally, we would like to thank David Rios and 
Francisco J. Prieto for their careful reading of the 
manuscript and suggestions. We hope that our paper 
and this interesting discussion encourage the statis- 
tical community to pursue further research on sup- 
port vector machines and other related methodolo- 
gies. 
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